Assignment 2: Task 2

Author

Olivia Hemond

Harry Potter and the Prisoner of Azkaban

Text Analysis Overview

Data Summary

This analysis uses the text from the book “Harry Potter and the Prisoner of Azkaban” by J.K. Rowling. The story follows Harry through his third year at Hogwarts, as he learns to fight Dementors, sneak around using the Marauder’s Map, and even travel in time! By the end of the story, Harry learns a lot about the mysterious escaped criminal Sirius Black and about his own parents’ past.

Source: Rowling, J. K. Harry Potter and the Prisoner of Azkaban. New York: Arthur A. Levine Books, 1999. Full text here.

Purpose

This analysis had two main goals:

  1. Find and visualize the most common words within each chapter and in the book as a whole

  2. Perform a sentiment analysis to visualize the tone (positive or negative) of each chapter of the book

Analytical Outline

  1. Import & Tidy Text

    • Prepare the data

      • Read in the complete PDF of Harry Potter and the Prisoner of Azkaban

      • Add column for page number

      • Create character strings for each line of text in the whole book

      • Let each line of text be its own row in the dataframe

      • Remove extra whitespaces

      • Add column for chapter number (in numeric format)

    • Extract every individual word

      • Let each word in the text be its own row in the dataframe (associated with the page and chapter from which it came)
    • Remove stop words

      • Stop words are commonly used words that don’t carry significant meaning (e.g., “of”, “a”, “the”, “and”)

      • Remove stop words from this dataset

  2. Find Most-Used Words

    • Calculate the five most-used words in each chapter and visualize them

    • Graph the frequency of different character names being mentioned throughout the book

    • Find the top 100 most-used words in the entire book and visualize them as a word cloud

  3. Perform Sentiment Analysis

    • Use the “afinn” lexicon to assign each word a value on the positive/negative scale

      • -5 being most negative, 5 being most positive
    • For each chapter, take a weighted average of the positivity/negativity scores (weighted by the amount of times each word was used)

    • Visualize how the tone of the book changes in each chapter

Import & Tidy Text

Code
library(tidyverse)
library(here)
library(tidytext)
library(pdftools)
library(ggwordcloud)
library(textdata)

Prepare the text data

Code
### Read in
hp3_text <- pdftools::pdf_text(here('data', 'hp_3.pdf'))
Code
### Wrangle and tidy
hp3_lines <- data.frame(hp3_text) %>% 
  mutate(page = 1:n() - 1) %>% 
  mutate(text_full = str_split(hp3_text, pattern = '\\n')) %>%  # creates character strings of each line
  unnest(text_full) %>%  # make row for each line of text
  select(!hp3_text) %>% # don't need original data column anymore
  mutate(text_full = str_squish(text_full)) # remove any extra whitespace
Code
### Add chapters in separate column
hp3_chapters <- hp3_lines %>% 
  slice(-1) %>% # remove empty first row
  mutate(chapter = ifelse(str_detect(text_full, "CHAPTER"), text_full, NA)) %>% # creates new chapter column
  mutate(chapter = str_remove(chapter, "CHAPTER")) %>% # remove the word "chapter"
  fill(chapter, .direction = 'down') %>% # assign chapter to all entries in each chapter
  mutate(chapter_num = case_when(
    chapter == " ONE" ~ 1,
    chapter == " TWO" ~ 2,
    chapter == " THREE" ~ 3,
    chapter == " FOUR" ~ 4,
    chapter == " FIVE" ~ 5,
    chapter == " SIX" ~ 6,
    chapter == " SEVEN" ~ 7,
    chapter == " EIGHT" ~ 8,
    chapter == " NINE" ~ 9,
    chapter == " TEN" ~ 10,
    chapter == " ELEVEN" ~ 11,
    chapter == " TWELVE" ~ 12,
    chapter == " THIRTEEN" ~ 13,
    chapter == " FOURTEEN" ~ 14,
    chapter == " FIFTEEN" ~ 15,
    chapter == " SIXTEEN" ~ 16,
    chapter == " SEVENTEEN" ~ 17,
    chapter == " EIGHTEEN" ~ 18,
    chapter == " NINETEEN" ~ 19,
    chapter == " TWENTY" ~ 20,
    chapter == " TWENTY-ONE" ~ 21,
    chapter == " TWENTY-TWO" ~ 22
    )) # change written chapter numbers into actual numbers

Get words and wordcount

Code
### Get words
hp3_words <- hp3_chapters %>% 
  unnest_tokens(word, text_full) %>% 
  select(page, chapter_num, word) %>% 
  mutate(word = str_split_i(word, pattern = "'s", 1)) # some have 's, (harry's), want this to count with root word

### Wordcount
hp3_wordcount <- hp3_words %>% 
  count(chapter_num, word)

Remove stop words

Code
hp3_wordcount_clean <- hp3_wordcount %>% 
  anti_join(stop_words, by = "word") 

Most-Used Words

By Chapter

Code
### Top 5 words for each chapter
top_5_words <- hp3_wordcount_clean %>% 
  group_by(chapter_num) %>% 
  arrange(-n) %>% 
  slice(1:5) %>% 
  ungroup()

### Plot
ggplot(top_5_words, aes(x = n, y = word)) +
  geom_col(fill = "#740001") +
  facet_wrap(~as.factor(chapter_num), scales = "free") +
  labs(x = "", y = "") +
  theme_minimal()

Figure 1: The top 5 words used in each chapter in the book. Bar sizes depict the amount of times each word was used.

Many of the most used words are the names of characters, which makes sense given it’s a book with a lot of dialogue and third-person narration. Using character names as a proxy for their relevance in any given chapter, we can track how certain characters appear / disappear from the narrative:

Code
### Look at some key characters over the course of the book
hp3_character_count <- hp3_wordcount_clean %>% 
  filter(word %in% c("hagrid", "lupin", "buckbeak", "snape", "pettigrew", "sirius"))

hp3_words_by_chap <- hp3_wordcount_clean %>% 
  group_by(chapter_num) %>% 
  summarize(word_count = sum(n))

hp3_characters_by_chap <- left_join(hp3_character_count, hp3_words_by_chap, by = "chapter_num") %>% 
  mutate(freq_per_chap = n/word_count)

### Plot
ggplot(hp3_characters_by_chap, aes(x = as.factor(chapter_num), y = freq_per_chap, color = word, group = word)) +
  geom_point() +
  geom_line() +
  labs(x = "", y = "Frequency") +
  facet_wrap(~word, scales = "free_y", nrow = 3) +
  scale_color_manual(values = c("#740001", "#AE0001", "#EEBA30", "#D3A625", "#000000", "darkgreen")) +
  theme_minimal() +
  theme(legend.position = "none")

Figure 2: The frequency of Buckbeak, Hagrid, Lupin, Pettigrew, Sirius, and Snape being mentioned in each chapter of the book. The frequency was calculated by dividing the number of uses of each word by the total number of words in that chapter.

Entire Book

Code
### Top 100 words from whole book
hp3_top100 <- hp3_wordcount_clean %>% 
  group_by(word) %>% 
  summarize(n = sum(n)) %>% 
  arrange(-n) %>% 
  slice(1:100)
Code
### Create wordcloud
ggplot(data = hp3_top100, aes(label = word)) +
  geom_text_wordcloud(aes(color = n, size = n+500), shape = "star") +
  scale_size_area(max_size = 6) +
  scale_color_gradientn(colors = c("#740001","#D3A625","#AE0001")) +
  theme_minimal()

Figure 3: Wordcloud of the top 100 words used in the book, sized by their number of uses.

Sentiment Analysis

Code
### Read in the 'afinn' lexicon
afinn_lex <- get_sentiments(lexicon = "afinn")
Code
### Join the lexicon to our words
hp3_afinn <- hp3_words %>% 
  inner_join(afinn_lex, by = 'word')
Code
### Count the number of words in each chapter assigned to each value (from -5 to 5)
afinn_counts <- hp3_afinn %>% 
  group_by(chapter_num, value) %>%
  summarize(n = n())

### Take a weighted average of the values (using the number of words to weight)
afinn_mean <- afinn_counts %>% 
  summarize(weighted_avg_value = weighted.mean(value, n))

### Plot 
ggplot(data = afinn_mean) +
  geom_col(aes(x = as.factor(chapter_num), y = weighted_avg_value, fill = weighted_avg_value > 0)) +
  labs(x = "Chapter", y = "Average Word Positivity") +
  scale_fill_manual(name = 'Positive?', values = setNames(c('#D3A625','#AE0001'), c(T, F))) +
  theme_minimal() +
  theme(legend.position = "none") 

Figure 4: Sentiment analysis of the average positivity of each chapter in the book. Values above 0 indicate the chapter had an overall positive tone. Values below 0 indicate a negative tone. Average positivity was calculated by weighting the positivity score of words by their amount of useage.

The first portion of the book has some overall positive chapters (like Chapter 4, where Harry returns to school and reunites with his friends) and some more negative chapters (like Chapters 2 and 3, where Harry accidentally inflates his Aunt Marge like a balloon, and then must run away and catch the chaotic Knight Bus). The latter half of the book takes on a much heavier and more negative tone, with the most negative chapter being Chapter 17, where Harry, Ron, and Hermione find themselves caught in the Shrieking Shack amidst a showdown between Sirius Black, Professor Lupin, Professor Snape, and Peter Pettigrew (as the rat Scabbers). The book ultimately ends on a positive note in the final chapter, once Sirius and Buckbeak have safely escaped from their respective death sentences!